Poisson Reads Model

In the reads model, we have the following representation

\[ c_{n,g} \sim Poi \left ( exp \left (\sum_{k=1}^{K} \alpha_{kg} \omega^{(m)}_{nk} + \beta_{b(n):g} +\epsilon_{ng} \right ) \right) \hspace{1 cm} \forall n, \hspace{0.2 cm} g \; fixed \]

where we assume that \(\beta_{b:g} \sim N(0,{\sigma_b}^2)\) represents the random effect and the term \(\epsilon_{ng} \sim N(0,{\sigma_e}^2)\) represents the overdispersion in the model.

Our aim would be to estimte each of the terms \(\sigma_e\), \(\sigma_b\), \(\alpha\), \(\beta\) (BLUPs) and the topic proportions or the cluster proportions \(\omega\)’s.

In order to ensure faster convergence, it is always essential to start with some good initial values that are believed to be closer estimates of the actual parameters. Also a problem like this involving so many parameters also makes it challenging to reach the global optimum in the underlying optimization problem and to pinpoint the correct estimates. So, we go over briefly the initialization mechanism we have used for the reads model.

Initialization

  • For this model, we only needed to initialize the topic proportion vectors \(\omega\)’s suitably. We started with \(\omega\) following

\[ \omega_{n.} \sim Dir(\frac{\delta}{K}, \frac{\delta}{K}, \cdots, \frac{\delta}{K}) \]

where \(\delta=3\) is the scaling parameter and \(K=4\) are the number of topics.

We first consider \(log(c_{ng})\) and we fit the model

\[ log(c_{ng}+\epsilon) = \sum_{k=1}^{K} \omega_{nk}\alpha_{kg} + \beta_{b(n):g} +\epsilon_{ng} \]

using the LMER function and we run this for 100 iterations to obtain the estimates of the \(\omega_{nk}\) whch we then denote as the \(0\)th ierate \(\omega^{(0)}_{nk}\).

Estimation

\(\alpha_{kg}\) is basically the coefficient of the regressor \(\omega^{(m)}_{\star,k}\) and this is a Generalized linear mixed model (LMM) with \(K\) many regressors given by \(\omega^{(m)}_{nk}\) and two random effects (one representing the batch effect and the other representing the dispersion effect), for each \(g\). We solve this using the glmer function of the package lme4 in R with family poisson(), and thus obtain the estimates of the regression coefficients \(\alpha_{kg}\) and the random effects \(\beta_{b(n):g}\) and \(\epsilon_{ng}\). We call these estimated values \(\alpha^{(m)}_{kg}\),\(\beta^{(m)}_{bg}\) and \(\epsilon^{(m)}_{ng}\) for \(b=1,2,\cdots,B\), \(k=1,2,\cdots,K\) and \(g=1,2,\cdots,G\). Given these estimates we can determine \(\omega^{(m+1)}_{nk}\) as above

\[ v^{(m+1)}_{n,\star} = \underset{v_{n,\star}\; \forall n}{argmin} \sum_{n=1}^{N}\sum_{g=1}^{G} \left [ exp \left (\sum_{k=1}^{K} f_{nk}(v_{n,\star}) \alpha^{(m)}_{kg} + \beta^{(m)}_{b(n):g} + \epsilon^{(m)}_{ng}\right ) - c_{ng} * \left (\sum_{k=1}^{K} f_{nk}(v_{n,\star}) \alpha^{(m)}_{kg} + \beta^{(m)}_{b(n):g} + \epsilon^{(m)}_{ng} \right ) \right ] \]

\[ \omega^{(m+1)}_{nk} = f_{nk} (v^{(m+1)}_{n,\star}) \]

where \(v^{(m+1)}_{n,\star}\) takes values from the entire real line.

We use the simulation set up same as the one presented in the previous R chunk (with the batch effect included).

K=4;
G=100;
N=500;

alpha_true=matrix(rnorm((K)*G,0.5),nrow=(K)); ### the matrix of fixed effects

Label.Batch=c(rep(1,N/2),rep(2,N/2));

B=max(Label.Batch);

sigmab_true=2;

beta_true=matrix(0,B,G);       ###  the matrix of the random effect

for(g in 1:G)
{
  beta_true[,g]=rnorm(B,mean=0,sd=sigmab_true);
}

library(gtools)
T=10;
omega_true=matrix(rbind(rdirichlet(T*10,c(3,4,2,6)),rdirichlet(T*10,c(1,4,6,3)),
            rdirichlet(T*10,c(4,1,2,2)),rdirichlet(T*10,c(2,6,3,2)),
            rdirichlet(T*10,c(3,3,5,4))), nrow=N);


over_dis=0.5;

noise_true=matrix(0,N,G);

for(n in 1:N)
{
    noise_true[n,]=over_dis*rnorm(G,0,1);
}




###  generating the table 


read_counts=matrix(0,N,G);



for(n in 1:N)
{
    for(g in 1:G)
    {

        mean=exp(omega_true[n,]%*%alpha_true[,g] +beta_true[Label.Batch[n],g]+noise_true[n,g]);
        read_counts[n,g]=rpois(1,mean);
    }
}

The true topic proportion weights, the estimated topic proportion weight obtained from fitting the usual topic model due to Matt Taddy and that due to this mechanism that takes into account the random batch effect, are presented below in the form of Structure Plots.

True Proportion Plot (Overdispersion std error=1)
Topic Model Proportion Plot (Overdispersion std error=1)
Estimated Proportion Plot (Overdispersion std error=1)
True Proportion Plot (Overdispersion std error=0.5)
Topic Model Proportion Plot (Overdispersion std error=0.5)
Estimated Proportion Plot (Overdispersion std error=0.5)

Poisson Gene Level Model

In this model, we assume the following distribution for \(c_{ng}\)’s

\[ c_{ng} \sim \sum_{k=1}^{K} \omega_{nk} Poi(\nu_{b(n):k}(g)) \]

where

\[ \nu_{b(n):k} (g) = exp(\alpha_{kg}+\beta_{b(n):g} +\epsilon_{ng}) \]

For the initialization, we noticed that the most important term in this case are the \(\alpha_{kg}\)’s. It was observed by running multiple simulations, that if the \(\alpha_{kg}\)’s are closer to the true values, the convergence was good, otherwise it would lead to problems. So, we mainly focused at the initialization of the \(\alpha_{kg}\)’s.

Initialization

We fit a Normal Topic Model (NTM) with respect to \(log(c_{ng}+10^{-7})\). We fit the model

\[ log(c_{ng}+10^{-7}) = \sum_{k=1}^{K} \omega_{nk}\alpha_{kg} + \beta_{b(n):g}+e_{ng} \]

We run this for 100 iterations and we obtain some preprocessing estimates \(\hat{\omega}_{nk}\) and \(\hat{alpha}_{kg}\). Once we obtain these estimates, we then do the following to further modify the estimates.

Let the posterior probability of the counts \(c_{ng}\) to come from \(k\)th cluster is given by

\[P(Z_{ng}=k|c_{ng}) = \frac{\omega_{nk}Poi(c_{ng},\alpha_{kg}+\beta_{b(n):g}+e_{ng})}{\sum_{l=1}^{K} \omega_{nl}Poi(c_{ng},\alpha_{lg}+\beta_{b(n):g}+e_{ng})} \]

We then obtain for each \((n,g)\)th entry of the cell, we simulate one \(Z_{ng}\) using the above posterior probability estimated using the preprocessing estimates obtained from the previous step.

Then given the topic labels matrix \(Z_{ng}\), we then fit a glmer function with the counts \(c_{ng}\) being the response and following

\[ c_{ng} \sim Poi(exp(\alpha_{Z_{ng}:g}+\beta_{b(n):g} +e_{ng})) \]

This will again give us some refined estimates of \(\alpha_{kg}\) which will be used as the initial values in the actual estimation step.

Estimation

In the actual estimation step, we basically fo llow an Iterative EM algorithm. Suppose we are at iteration \(m\). We define then

\[ Pr[Z_{ng}=k | c_{ng}] = \frac{\omega^{(m)}_{nk}Poi(c_{ng},\alpha^{(m)}_{kg}+\beta^{(m)}_{b(n):g}+e^{(m)}_{ng})}{\sum_{l=1}^{K} {\omega^{(m)}_{nl}Poi(c_{ng},\alpha^{(m)}_{lg}+\beta^{(m)}_{b(n):g}+e^{(m)}_{ng})}} \]

\[ \omega^{(m+1)}_{nk} = \frac{1}{G} Pr[Z_{ng}=k | c_{ng}] \]

Given the estimates \(\beta^{(m)}_{b(n):g}\), one can show using Jensen’s inequality on the log likelihood term that the \((m+1)\)th iterate for the \(\alpha\)’s would be

\[ \alpha^{(m+1)}_{kg} = log \left (\frac{\sum_{n=1}^{N} \omega^{(m)}_{nk} c_{ng}}{\sum_{n=1}^{N} \omega^{(m)}_{nk} exp(\beta^{(m)}_{b(n):g}+e^{(m)}_{ng}) }\right ) \]

Then to obtain the BLUP for \(\beta\)’s and the \(e\)’s we minimize the quantity (equiavlent to posterior maximization with respect to the random effects)

\[ \beta^{(m+1)} =arg min \left [ -log L(\omega^{(m+1)}, \alpha^{(m+1)}, \beta, e^{(m)}) + \sum_{b=1}^{B}\sum_{g=1}^{G} \beta^2_{bg} \right] \]

Similarly for \(e\) we use the function

\[ e^{(m+1)} = arg min \left [-log L(\omega^{(m+1)}, \alpha^{(m+1)}, \beta^{(m+1)}, e) + \sum_{n=1}^{N}\sum_{g=1}^{G} e^2_{ng} \right ] \]

We find that this approach is actually pretty fast since we have exact expressions for \(\alpha\) and \(\omega\) that take less time to evaluate. This approach is called Penalized log likelihood method.

The simulation set up for this model is presented in the R chunk below

K=4;
G=100;
N=500;

alpha_true=matrix(rnorm((K)*G,1,1),nrow=(K)); ### the matrix of fixed effects

Label.Batch=c(rep(1,N/2),rep(2,N/2));

B=max(Label.Batch);

sigmab_true=1;

beta_true=matrix(0,B,G);       ###  the matrix of the random effect

for(g in 1:G)
{
  beta_true[,g]=rnorm(B,mean=0,sd=sigmab_true);
}

library(gtools)
T=10;
omega_true=matrix(rbind(rdirichlet(T*10,c(3,4,2,6)),rdirichlet(T*10,c(1,4,6,3)),
            rdirichlet(T*10,c(4,1,2,2)),rdirichlet(T*10,c(2,6,3,2)),
            rdirichlet(T*10,c(3,3,5,4))), nrow=N);


H_true=alpha_true;
alpha_true=H_true

###  generating the table 



over_dis=0.5;

noise_true=matrix(0,N,G);

for(n in 1:N)
{
    noise_true[n,]=over_dis*rnorm(G,0,1);
}


read_counts=matrix(0,N,G);
indicator=matrix(0,N,G);

for(n in 1:N)
{
    for(g in 1:G)
    {
        index=sample(1:K,1,omega_true[n,],replace=T);
        mean=exp(alpha_true[index,g] +beta_true[Label.Batch[n],g]+noise_true[n,g]);
        read_counts[n,g]=rpois(1,mean);
        indicator[n,g]=index;
    }
}

The graphs for the model simulations are given below

True Proportion Plot (Overdispersion std error=1)
Topic Model Proportion Plot (Overdispersion std error=1)
Estimated Proportion Plot (Overdispersion std error=1)
True Proportion Plot (Overdispersion std error=0.5)
Topic Model Proportion Plot (Overdispersion std error=0.5)
Estimated Proportion Plot (Overdispersion std error=0.5)

Gene Membership

Gene membership denotes the membership probability of a gene in a particular cluster \(k\). Given a ene \(g\), which cluster has what ercentage representation of that gene? For the gene level model, this can be attained probabilisticaly as

\[\theta_{kg} = \frac{\sum_{n=1}^{N} \omega_{nk} Poi(c_{ng},exp(\alpha_{kg} + \beta_{b(n):g} + e_{ng}))}{\sum_{n=1}^{N} \sum_{l=1}^{K} \omega_{nl} Poi(c_{ng},exp(\alpha_{lg} + \beta_{b(n):g} + e_{ng}))} \]

However such an easily interpretable membership coefficient is not possible to attain for the Reads model. so for the reads model, we define the membership coefficient to be

\[ \theta_{kg} = \frac{|\alpha_{kg}|}{\sum_{l=1}^{K}|\alpha_{lg}|} \]

This would be the proportion of absolute mean intensity of the gene \(g\) explained by the \(k\)th cluster.